Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vivekananda_session3_4VP20CS062_Prajwal #9

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Prajwal7Amin
Copy link

No description provided.

# This function will add the entry to database
sql = """INSERT INTO members_blog (title, release_date, blog_time, author,created_date, content, recommended, html) VALUES (%s, %s::DATE, %s::TIME, %s, NOW(), %s, %s, %s)"""

with conn:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin why with is used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds with statement ensures that the connection is closed. Here with statement is used to manage the database connection and ensures that the connection is open while executing the code block and automatically closes the connection when the block is exited

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin In your code i see that you've started transaction(insert/truncate) but I don't see a commit anywhere for it.
Is the data actually stored in the database ? If yes, then how the data is stored?

Copy link
Author

@Prajwal7Amin Prajwal7Amin Jun 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds Yes, the data is stored in database. Data is stored into 'members_blog' table and each row contains the data corresponding to the columns defined like title, date, author, content, blog time etc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin well you've answered my question partially. Could you talk about the commit that I've asked?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds Yes, I have understood it now. So, in my code I have not explicitly called 'commit' method on 'conn' object (conn.commit). Because ' with conn: ' block handles the transaction and automatically commits the changes when it exits. So ' with conn:' automatically commits the transaction when exiting the block. This behavior is specified by the psycopg2 library.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin Good. The reason for this behaviour is because psycopg2 objects are context managers. When a error occurs while executing a query, the transaction is automatically rolled back by the context manager. If you're wondering why rolling back a transaction is important then try to experiment with this. To do this, you've to enclose the execute line in a try-except block and ensure that the query breaks. Then continue to execute the next query. But make sure to print the exception otherwise you won't know what's going on.
Also I want you to look into converting the object of a class to context manager.

for con in contents:'''
content = post.select('.post-body')[0].text
html = 'S:\web_scapping\python_blogs.html'
with open(html, 'w', encoding='utf-8') as f:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin You're writing to a file inside the loop. Do you think your file will contain the whole html?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds Yes, it contains the whole html

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin in the open method, the first argument is the file name, what does that second argument indicate/ why is that used here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds Second argument indicates the mode in which the file is opened. In this case w is the write mode.
It is used here to write the new data into the file (html = 'S:\web_scraping\python_blogs.html').

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin Alright, so you've used write mode here.
Do you know what are modes are available? Could you brief about them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds 'r' -> Is the read mode. Opens the file for reading and is the default mode. Rises error is file doesn't exist
'w' -> Write mode. Opens the file for writing. If the file already exists, it truncates its contents. If the file doesn't exist, it creates a new file.
'x' -> Exclusive creation mode. Opens the file for writing only if it doesn't exist. If the file exists the operation fails(error).
'a' -> Append mode. Opens the file for writing and appends data to the end of it without truncating it. If the file doesn't exist it creates a new file.
'b' -> Binary mode. Used together with other modes like 'r' or 'w' to handle binary files. It is commonly used for reading or writing non-text files like images or audio files.
't' -> Text mode. This is the default mode and is used in conjunction with other modes like 'r' or 'w' to handle text files. It represents the file as a sequence of strings.
'+': Update mode. Open a file for updating (reading and writing). Used together with other modes, such as 'r+', 'w+', or 'a+'.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin So considering the explanation given by you for write mode, you're writing to a file inside the loop. So let me ask you the same question again, does your file contain the whole html?

Copy link
Author

@Prajwal7Amin Prajwal7Amin Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds Thank you for noticing. Yes, you were right, the file does not contain the whole html. The ' python_blogs.html ' file contains the HTML content from the last iteration, representing the final webpage that was scraped.

Copy link
Author

@Prajwal7Amin Prajwal7Amin Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anushds Should I use append mode 'a' here instead of write mode?.
Is it right? Could you help me with this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajwal7Amin Good that you realised.
There are multiple approaches to this,

  1. Store the html contents of each iteration into a variable (say `html_contents'). After the loop ends write this into the file.
  2. Yea, you can open the file in append mode and then write to it. But I wouldn't suggest doing that because opening and closing file every iteration is a expensive operation (provided the number of calls made to OS). Sure the difference will be in seconds but even these "extra " seconds consumed by your code in doing this operation matters from a production perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants